We will use numpy.random simulations to visualize and learn about two crucial concepts in AB Testing:
p-value: Statistic used in AB Testing to declare winners
Statistical Power: Strength of an AB Test Design
By the end of this paper you’ll be able to use NumPy to design and analyze traditional Significance Tests; naturally building your intuition for probabilistic thinking.
Significance Test
A significance test assumes that test conditions are identical. After the test, we calculate a p-value.
Definition: The p-value estimates the probability of seeing our experiment (or more extreme), given equality between our two test conditions.
If the p-value is low enough, we are compelled to conclude that our assumption of equality of test conditions is faulty. This is referred to as a Statistically Significant result.
Our First Fixed AB Test
Let’s imagine a test of two websites A (Control) & B (Treatment) where each site gets 10,000 visitors. Below are conversion rate results for each site. Is it time to celebrate?
To calculate p-values manually, we simulate results with numpy.random.binomial by randomly sampling from two identical websites, both converting at 50% with 10,000 samples each. We repeat simuations up to 500,000 times for each website.
Code
from numpy.random import binomial, seednumber_trials_per_condition =int(1e4)conversion_rate =0.50NUMBER_SIMULATIONS =int(5e5)BLUE ="#0177c9"seed(42)# Simulate conversions for the first website (single batch)conversions_1 = binomial( n=number_trials_per_condition, p=conversion_rate, size=NUMBER_SIMULATIONS)# Simulate conversions for the second website (separately generated)conversions_2 = binomial( n=number_trials_per_condition, p=conversion_rate, size=NUMBER_SIMULATIONS)# Compute differencesconversion_differences = conversions_2 - conversions_1
Central Limit Theorem: Watch how normality of averages is revealed with enough simulations.
Website A
Code
from math import ceilimport pandas as pdimport plotly.express as pximport plotly.subplots as spfrom numpy import (array, array_equal, cumsum, linspace, max, mean, min, percentile, where)# Store all posteriorsall_posteriors = ( array([conversions_1, conversions_2, conversion_differences])/ number_trials_per_condition)number_simulations_list = [#500,1000,5000,100000,]all_figs = []count =1for posterior in all_posteriors: df = pd.DataFrame(posterior, columns=["value"]) running_total_df = df.copy()# Count occurrences and create a running total running_total_df["count"] = running_total_df["value"].map( running_total_df["value"].value_counts() ) running_total_df["running_total"] = running_total_df.groupby("value").cumcount() +1# Rearranging to match the running total of occurrences running_total_df = running_total_df[["value", "count", "running_total"]] subplot_titles =tuple([f"# sims = {value:,}"for value in number_simulations_list]) number_simulations_list_length =len(number_simulations_list) number_cols = number_simulations_list_length number_rows = ceil(number_simulations_list_length / number_cols) BLUE ="#82b2e0" RED ="#bd0707" fig = sp.make_subplots( rows=number_rows, cols=number_cols, subplot_titles=subplot_titles, vertical_spacing=0.1, )for j inrange(len(number_simulations_list)): number = number_simulations_list[j] fig_0 = px.scatter( running_total_df[0:number], x="value", y="running_total", color_discrete_sequence=[BLUE], ) row =1+int(j / number_cols) col =1+ j - (row -1) * number_cols fig.add_trace(fig_0.data[0], row=row, col=col) x_ticks = ( [0.485, 0.5, 0.515]if count !=len(all_posteriors)else [-0.02, 0, 0.02] )for j inrange(len(number_simulations_list) +1): row =1 col = j fig.update_xaxes( tickformat=".1%", tickvals=x_ticks if j >0else [int(df["value"].iloc[0])], row=row, col=col, ) fig.update_layout(height=300) all_figs.append(fig) count+=1all_figs[0].show()
Figure 1: Website Conversion = 50% with 10k samples per test condition
Website A (Copy)
Figure 2: Simulations of Website A (Copy) aren’t copies of Website A simulations
While both websites convert at 50% on average, there is variation in their results. Sometimes A (Copy) outperforms, sometimes A.
Difference of Website A vs Website A
Figure 3: Differences of means reveal variance even between identical websites
When we difference results from A vs A (copy), we see that average difference is centered at 0.00%, but can vary as much as ±2%.
Our First P-Value
Recall that our initial A/B test showed an Observed Treatment Effect = 1.09% in favor of B. Let’s zoom into our 500,000 simulated A/A tests and highlight the area where differences are at least 1.09%.
Code
import plotly.express as px# Define a new column for colorrunning_total_df["Area"] = running_total_df["value"].apply(lambda x: "p-value"if x > avg_difference else"1 - p-value")fig_0 = px.scatter( running_total_df, x="value", y="running_total", color="Area", # Use the new color column color_discrete_map={"p-value": "green","1 - p-value": BLUE, }, # Keep your previous blue color)fig_0.update_layout(hovermode=False, xaxis=dict(tickformat=".1%"), legend=dict( font=dict(size=16), # Increase legend text size itemsizing='constant', # Keeps symbol size larger ) )fig_0.update_yaxes(title ="", showticklabels =False)fig_0.update_xaxes(title="Difference of Means")fig_0.show()
Figure 4: 6.0% of simulated A/A experiments fall beyond 1.09%
Only 30,157 out of our 500,000 simulated A/A tests breached 1.09%.
Assuming our first AB Test is actually an A/A test, the chance of getting such an experiment is 0.06.
In other words, p-value = 0.06.
Proper P-Value Practice
Recall that a p-value estimates the chance of seeing our experiment (or more extreme), given equality between test conditions. If it’s low enough, we throw away our assumption of equality of test conditions.
But that’s it. It doesn’t tell you how accurate your Observed Treatment Effect is.
Significance Threshold
How unlikely must your experiment be (low p-value) to reject your initial assumption of equality? Traditionally, a Significance Threshold of α = 0.10 or α = 0.05 is used.
α controls the rate of False Positives (Type I Error). Choose wisely.
Our First Test Design
We learned how to use p-values to analyze the results of an experiment. But how do we design a quality experiment in the first place? We need to simulate two distributions.
Null Distribution: The range of expected differences when our two test conditions are equal, an A/A test. Alternative Distribution: The range of expected differences when our test effect is present, an A/B test.
We will use these distributions to determine how many samples per test condition we need for a quality experiment.
Null Distribution
We use our previous 500,000 A/A simulations, with Baseline = 50% and 10k signups per condition.
Setting our Significance Threshold, α = 0.05, future experiments that fall into the red sections (each 2.5%) are so atypical, that we will be forced to reject the assumption that A = B.
Since 5% of A/A simulations fall into this region, our False Positive Risk is 5%. We can lower α to reduce this probability, but you will see later it comes at a cost.
Code
import plotly.express as pximport plotly.graph_objects as gofrom numpy import percentile# Define alpha levelalpha =0.025# Compute Significance Threshold boundaryrejection_difference = percentile(all_posteriors[2], 100- alpha *100)# Define a new column for color based on the Significance Thresholdrunning_total_df["Area"] = running_total_df["value"].apply(lambda x: "Stats Sig"ifabs(x) > rejection_difference else"Retain Assumption")# Create scatter plotfig_0 = px.scatter( running_total_df, x="value", y="running_total", color="Area", # Use the new color column color_discrete_map={"Stats Sig": "red","Retain Assumption": BLUE, # Fixed the BLUE variable },)# Add vertical lines for Significance Thresholdsfor x_pos, text inzip( [rejection_difference, -rejection_difference], [f"{rejection_difference:.2%}", f"{-rejection_difference:.2%}"],): fig_0.add_shape( go.layout.Shape(type="line", x0=x_pos, x1=x_pos, y0=running_total_df["running_total"].min(), y1=running_total_df["running_total"].max(), line=dict(color="black", width=2, dash="dash"), ) )# Add text annotation next to the vertical line fig_0.add_annotation( x=x_pos, y=running_total_df["running_total"].max(), text=text, showarrow=False, yshift=10, # Shift text slightly above the line font=dict(size=14, color="black"), )# Customize layoutfig_0.update_layout( hovermode=False, xaxis=dict(tickformat=".1%"), legend=dict( font=dict(size=16), # Increase legend text size itemsizing="constant", # Keeps symbol size larger ),)# Hide y-axis title and labelsfig_0.update_yaxes(title="", showticklabels=False)fig_0.update_xaxes(title="Difference of Means")# Show figurefig_0.show()
Figure 5: 95% of simulated A/A tests fall into (-1.38%, 1.38%)
Alternative Distribution
The team worked really hard and developed a new website that potentially converts at 51% (MDE = 1.02 = 51/50). What is the probability our experiment identifies a winner (Statistical Power)?
In order to calculate this, we need to simulate two websites:
Control Website converting at 50% with 10,000 visitors
Treatment Website converting at 51% with 10,000 visitors
The resulting differences of means will be our Alternative Distribution.
Code
from numpy.random import binomial, seednumber_trials_per_condition =int(1e4)conversion_rate =0.50treatment_rate =0.51# BLUE = "#0177c9"seed(1)# Simulate conversions for the first website (single batch)control_conversions_alt = binomial( n=number_trials_per_condition, p=conversion_rate, size=NUMBER_SIMULATIONS)# Simulate conversions for the second website (separately generated)treatment_conversions = binomial( n=number_trials_per_condition, p=treatment_rate, size=NUMBER_SIMULATIONS)# Compute differencesconversion_differences_alt = treatment_conversions - control_conversions_altdf = pd.DataFrame(conversion_differences_alt, columns=["value"])running_total_df = df.copy()# Count occurrences and create a running totalrunning_total_df["count"] = running_total_df["value"].map( running_total_df["value"].value_counts())running_total_df["running_total"] = running_total_df.groupby("value").cumcount() +1running_total_df["value"] = running_total_df["value"] / number_trials_per_condition# Rearranging to match the running total of occurrencesrunning_total_df = running_total_df[["value", "count", "running_total"]]running_total_df["Area"] = running_total_df["value"].apply(lambda x: "Stats Sig"if x > rejection_difference else"Not Stats Sig")alt_fig = px.scatter( running_total_df, x="value", y="running_total", color="Area", # Use the new color column color_discrete_map={"Stats Sig": "green","Not Stats Sig": BLUE, }, # Keep your previous blue color)alt_fig.update_layout( hovermode=False, xaxis=dict(tickformat=".1%"), legend=dict( font=dict(size=16), # Increase legend text size itemsizing="constant", # Keeps symbol size larger ),)alt_fig.add_shape( go.layout.Shape(type="line", x0=rejection_difference, x1=rejection_difference, y0=running_total_df["running_total"].min(), y1=running_total_df["running_total"].max(), line=dict(color="black", width=2, dash="dash"), ))# Add text annotation next to the vertical linealt_fig.add_annotation( x=rejection_difference, y=running_total_df["running_total"].max(), text=f"Significance Threshold = {rejection_difference:.2%}", showarrow=False, yshift=10, # Shift text slightly above the line font=dict(size=14, color="black"),)alt_fig.update_yaxes(title="", showticklabels=False)alt_fig.update_xaxes(title="Difference of Means")alt_fig.show()share_of_winners =sum( where( conversion_differences_alt / number_trials_per_condition > rejection_difference,1,0, ))
Figure 6: Only 29% of simulated A/B tests manage to exceed 1.38%
Statistical Power: This experiment design is lousy. Though we programmed condition B to convert at 51%, only 28.79% of our 500,000 A/B simulations fall beyond our Significance Threshold.
We need more samples to make our experiment more reliable.
Increase Statistical Power
In order for our experiment to have a better chance of finding a winner, we need to up its Statistical Power by increasing number of samples in each condition. Let’s simulate experiments for 10,000; 20,000; and 40,000 samples per test condition.
Code
from numpy.random import binomial, seedimport plotly.graph_objects as gofrom plotly.subplots import make_subplotsfrom numpy import mean, where, percentilefrom IPython.display import display, Markdownseed(42)conversion_rate =0.50treatment_rate =0.51BLUE ="#0177c9"alpha =0.025# Significance level (two-tail)number_trials_per_condition_list = [int(1e4), int(2e4), int(4e4)]number_trials_per_condition_list_count =len(number_trials_per_condition_list)# Create subplot structure with reduced vertical spacingfig = make_subplots( rows=number_trials_per_condition_list_count, cols=1, shared_xaxes=True, vertical_spacing=0.15,)for i, number_trials_per_condition inenumerate(number_trials_per_condition_list): conversions_1 = binomial( n=number_trials_per_condition, p=conversion_rate, size=NUMBER_SIMULATIONS ) conversions_2 = binomial( n=number_trials_per_condition, p=conversion_rate, size=NUMBER_SIMULATIONS ) conversion_differences = ( conversions_2 - conversions_1 ) / number_trials_per_condition seed(1) control_conversions_alt = binomial( n=number_trials_per_condition, p=conversion_rate, size=NUMBER_SIMULATIONS ) treatment_conversions = binomial( n=number_trials_per_condition, p=treatment_rate, size=NUMBER_SIMULATIONS ) conversion_differences_alt = ( treatment_conversions - control_conversions_alt ) / number_trials_per_condition rejection_difference = percentile(conversion_differences, 100- alpha *100) share_winner = mean(where(conversion_differences_alt > rejection_difference, 1 , 0))# Add histograms fig.add_trace( go.Histogram( x=conversion_differences, nbinsx=50, name="A/A Simulations", marker=dict(color="#0177c9"), opacity=0.6, showlegend=Falseif i >0elseTrue, ), row=i +1, col=1, ) fig.add_trace( go.Histogram( x=conversion_differences_alt, nbinsx=50, name="A/B Simulations", marker=dict(color="#ff5733"), opacity=0.6, showlegend=Falseif i >0elseTrue, ), row=i +1, col=1, )# Add vertical rejection lines fig.add_vline( x=rejection_difference, line=dict(color="black", width=3, dash="dash"), row=i +1, col=1, annotation=dict( text=f"Significance Threshold: {rejection_difference:.2%}", font=dict(size=14, color="black"), arrowhead=2, x=rejection_difference, y=0.99, ), ) fig.add_vline( x=-rejection_difference, line=dict(color="black", width=3, dash="dash"), row=i +1, col=1, annotation=dict( text=f"Significance Threshold: {-rejection_difference:.2%}", font=dict(size=14, color="black"), arrowhead=2, xanchor="right", x=-rejection_difference, y=0.99, ), ) fig.add_annotation(# xref="paper", yref="y domain", row=i +1, col=1, # Reference the subplot's domain x=-0.035, # Align text to the left y=1.30, # Exactly at the top of the subplot text=f"{number_trials_per_condition:,} samples per test condition", showarrow=False, font=dict(size=18), align="left", xanchor="left", # Ensures left alignment yanchor="top", # Ensures the text is positioned at the top ) fig.update_xaxes( row=i+1, col=1, title=f"{share_winner:.2%} of A/B Simulations manage to fall beyond the Significance Threshold." )# Update layoutfig.update_layout(# xaxis_title="Difference of Means", template="plotly_white", height=550, hovermode=False, barmode="overlay",# showlegend=False,)fig.update_xaxes(tickformat=".1%")fig.update_yaxes(showticklabels=False)#fig.update_xaxes(# row=number_trials_per_condition_list_count, col=1, title="Difference of Means"#)# Show the figurefig.show()